1 Introduction

Predicting and managing urban gentrification is increasingly crucial for city planners and policymakers aiming to balance economic growth with community sustainability. This report introduces a predictive model for gentrification, focusing on its application on identifying areas at risk within Los Angeles City. The project leverages spatial and census data from 2015 to 2020 for the City of Los Angeles, integrating key metrics such as property values, demographic shifts, and local amenities to construct a robust model of urban transformation.

2 Data Collection

Our project tests a combination of spatial and non-spatial data to construct a comprehensive predictive model of gentrification in Los Angeles. The spatial data encompasses geographical boundaries of census tracts, locations of amenities such as grocery stores, restaurants, educational facilities, and transit stations. Non-spatial data comprises demographic and socioeconomic variables derived from the American Community Survey (2015-2020), which include population density, age distribution, ethnic composition, income levels, education levels, housing characteristics, and migration patterns.

2.1 Spatial Data

As current studies mentioned the impact of historical racial segregation (Hwang & Sampson, 2014) and migration (Hwang et al., 2015) on gentrification, we give attention to white and minority household ownership proportions and migration rates. We also consider more demographic variables referencing the study of DeVylder et al. (2019), including gender, age group, annual household income, and education level. We refined the raw census data to calculate percentage representations for critical metrics. This normalization allows us to assess gentrification impacts relative to the total population per tract, ensuring comparability across diverse geographic regions. Changes in key socioeconomic indicators over time (2015-2020) were computed to capture the dynamics of gentrification. These include changes in poverty levels, educational attainment, and racial demographics.

3 Measuring Gentrification

In our study, we adopted a refined approach to measuring gentrification, drawing upon the methodologies described in prior research, particularly from the National Community Reinvestment Coalition (NCRC) guidelines. Our primary aim was to identify census tracts that underwent significant socio-economic changes indicative of gentrification. To this end, we set criteria based on changes observed from 2015 to 2020, utilizing the American Community Survey (ACS) five-year data sets.

The criteria for defining a tract as gentrified were based on several key indicators:

Population Eligibility: Originally, tracts with a population greater than 500 were considered to be eligible tracts for further analysis, but in order to capture all information from the census tract data, we decided to use all tracts instead of the tracts with a population greater than 500 based on the 2014. Eligible tracts would also fit the following criteria in order to be included for further analysis: Median Home Value < 40th percentile and Median Household income < 40th percentile.

Socioeconomic Indicators: We assessed changes in median home values, median household income, and the percentage of residents with a college education. A tract was considered at risk of gentrification if it experienced: An increase in median home value and median household income greater than the 40th percentile citywide, rather than the 60th percentile as outlined in the original NCRC model. This adjustment was made to capture more subtle yet significant shifts that may not reach the higher threshold but still represent meaningful change. An increase in the proportion of college-educated residents, also above the 40th percentile, to reflect educational gentrification.

Metropolitan level Comparison: All data was loaded, calculated, and filtered in the Metropolitan level scale instead of trimming to the city area at the beginning and operating in the city scale, which would allow the gentrification identification to be more accurate, reflecting the gentrification trend in the ambient tracts.

Tracts based on 2020 census information meeting these criterias were classified as ‘gentrified’ (1) and all others as ‘non-gentrified’ (0). This methodological approach helps pinpoint the tracts most affected by gentrification based on significant socio-economic transformations. By lowering the percentile threshold for income and home value increases, we include areas undergoing earlier stages of gentrification, which are crucial for timely policy interventions.

## Reading layer `CityBoundaryofLosAngeles' from data source 
##   `/Users/georgechen/Documents/GitHub/PPA_Final/CityBoundaryofLosAngeles.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -118.6682 ymin: 33.70365 xmax: -118.1554 ymax: 34.33731
## Geodetic CRS:  WGS 84

4 Feature Engineering

4.1 Census data

As current studies mentioned the impact of historical racial segregation (Hwang & Sampson, 2014) and migration (Hwang et al., 2015) on gentrification, we give attention to white and minority household ownership proportions and migration rates. We also consider more demographic variables referencing the study of DeVylder et al. (2019), including gender, age group, annual household income, and education level. We refined the raw census data to calculate percentage representations for critical metrics. This normalization allows us to assess gentrification impacts relative to the total population per tract, ensuring comparability across diverse geographic regions. Changes in key socioeconomic indicators over time (2015-2020) were computed to capture the dynamics of gentrification. These include changes in poverty levels, educational attainment, and racial demographics.

4.2 Amenity Accessibility

We conducted spatial joins between census tracts and points of interest to count the number of amenities within each tract. This process included grocery stores, housing permits (indicative of development activity), restaurants, and educational facilities, providing a lens into the changing infrastructural landscape which often accompanies or signals gentrification.

4.3 Visualizing amenity patterns

5 Exploratory Analysis - Gentrification

5.1 Numeric Variables

The following figures are bar charts that represents the mean value of numeric variables within the dataset.

5.2 Crime Data Integration

To understand the relationship between safety and gentrification, we compiled crime statistics for 2015 and 2020, calculating changes over time to discern any correlations between gentrification and crime rates. The following maps illustrate the change in crime density across Los Angeles from 2015 to 2020, it can be observed a decrease in crime density in the upper town area within the tracts with relatively high crime numbers per tract. This rich dataset allows for a nuanced analysis of how various factors contribute to or are affected by gentrification, providing city planners with actionable insights into urban development processes.

6 Logistic regression

As we measured gentrification with binary classification, we chose logistic regression to predict gentrification in Los Angeles because it effectively models the probability of each census tract becoming gentrified based on socio-economic and demographic predictors. We finally involved a list of demographic and socio-economic predictors, include:

  • gentrification: An index using 0 and 1 to indicate whether a tract has gentrified. 0 for no and 1 for yes.

  • incomeChange: income change from 2015 to 2020

  • changeinpovwerty: Change in percentage of poverty from 2015 to 2020

  • ForMig_Change: Change in percentage of migration from 2015 to 2020

  • changeinbachelor: Change in percentage of people with at least a bachelor degree from 2015 to 2020

  • housingprice20: Median rent price per tract in 2020

  • changeinhouseprice: Change in median house price per tract from 2015 to 2020

  • changein2544: Change in population aged from 25 to 44 per tract from 2015 to 2020

  • rent20: Median rent price per tract in 2020

  • rent: Median rent price per tract in 2015

  • changeinwhite: Change in percentage of people with at least a bachelor degree

  • pctBachelors20: Percentage of people with at least a bachelor’s degree in 2020

  • newhousingunit: new housing units from 2015 to 2020

  • Race: the racial context

  • crimeChange: Crime change number between 2015 and 2020.

  • Density_Change: Change in population density between 2015 and 2020

While TOD and amenity features don’t contribute well to the model accuracy so we dropped them in the end.

6.1 Modeling building

Estimate Std. Error z value
(Intercept) 3.3638534 4.4438213 0.7569731
changeinpoverty 11.5353439 15.6267896 0.7381775
changeinbachelor -28.5774509 230.2222616 -0.1241298
changeinwhite 0.0412974 6.1224493 0.0067452
pctBachelors20 -190.9230455 357.3621565 -0.5342565
changein2544 -11.7804180 38.6451691 -0.3048355
houseprice20 -0.0000337 0.0000186 -1.8091612
rent -0.0029736 0.0051635 -0.5758938
rent20 0.0011469 0.0027592 0.4156756
changeinhouseprice 0.0000674 0.0000325 2.0714647
newhousingunit -0.0053349 0.0092816 -0.5747883
incomeChange 0.0001137 0.0001203 0.9453091
crimeChange -0.0126408 0.0232188 -0.5444215
ForMig_Change 29.5252539 76.2806294 0.3870610
Density_Change -755.1584616 8317.1253333 -0.0907956
raceWhite 0.9545769 1.7935846 0.5322174

## 
## Call:
## glm(formula = gentrification ~ ., family = binomial(link = "logit"), 
##     data = gentrifydataTrain %>% dplyr::select(gentrification, 
##         changeinpoverty, changeinbachelor, changeinwhite, pctBachelors20, 
##         changein2544, houseprice20, rent, rent20, changeinhouseprice, 
##         newhousingunit, incomeChange, crimeChange, ForMig_Change, 
##         Density_Change, race))
## 
## Coefficients:
##                         Estimate    Std. Error z value Pr(>|z|)  
## (Intercept)           3.36385339    4.44382128   0.757   0.4491  
## changeinpoverty      11.53534393   15.62678957   0.738   0.4604  
## changeinbachelor    -28.57745090  230.22226158  -0.124   0.9012  
## changeinwhite         0.04129735    6.12244932   0.007   0.9946  
## pctBachelors20     -190.92304553  357.36215654  -0.534   0.5932  
## changein2544        -11.78041802   38.64516911  -0.305   0.7605  
## houseprice20         -0.00003367    0.00001861  -1.809   0.0704 .
## rent                 -0.00297365    0.00516353  -0.576   0.5647  
## rent20                0.00114693    0.00275920   0.416   0.6776  
## changeinhouseprice    0.00006742    0.00003255   2.071   0.0383 *
## newhousingunit       -0.00533493    0.00928155  -0.575   0.5654  
## incomeChange          0.00011369    0.00012026   0.945   0.3445  
## crimeChange          -0.01264079    0.02321875  -0.544   0.5862  
## ForMig_Change        29.52525395   76.28062941   0.387   0.6987  
## Density_Change     -755.15846164 8317.12533330  -0.091   0.9277  
## raceWhite             0.95457694    1.79358457   0.532   0.5946  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 46.13  on 72  degrees of freedom
## Residual deviance: 23.92  on 57  degrees of freedom
##   (43 observations deleted due to missingness)
## AIC: 55.92
## 
## Number of Fisher Scoring iterations: 9

The model has a McFadden score of 0.48, suggesting a good prediction performance.

## fitting null model for pseudo-r2
##         llh     llhNull          G2    McFadden        r2ML        r2CU 
## -11.9597969 -23.0649552  22.2103165   0.4814732   0.2623242   0.5600098
##     Outcome      Probs
## 117       0 0.01834381
## 118       0 0.16846636
## 119       0 0.03480654
## 120       0 0.09365045
## 121       0 0.01092408
## 122       0         NA

We can see that our model is better at prediciting the negatives rather than the positives.

6.2 Confusion matrix

The Confusion Matrix shows the number of observed instances of tracts gentrified. Each entry in the matrix provides a different comparison between observed and predicted, given the 15% threshold. The overall accuracy is 80%, which is acceptable.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 512  21
##          1 105  34
##                                             
##                Accuracy : 0.8125            
##                  95% CI : (0.7809, 0.8413)  
##     No Information Rate : 0.9182            
##     P-Value [Acc > NIR] : 1                 
##                                             
##                   Kappa : 0.2642            
##                                             
##  Mcnemar's Test P-Value : 0.0000000000001422
##                                             
##             Sensitivity : 0.61818           
##             Specificity : 0.82982           
##          Pos Pred Value : 0.24460           
##          Neg Pred Value : 0.96060           
##              Prevalence : 0.08185           
##          Detection Rate : 0.05060           
##    Detection Prevalence : 0.20685           
##       Balanced Accuracy : 0.72400           
##                                             
##        'Positive' Class : 1                 
## 

6.3 ROC Curve

The AUC curve for our model is .83, proposing that we have a strong model with the feature engineered variables.

## Area under the curve: 0.8348

6.4 Cross Validation

## [1] "No"  "Yes"
## Generalized Linear Model 
## 
## 745 samples
##  15 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 671, 670, 671, 671, 669, 671, ... 
## Resampling results:
## 
##   ROC       Sens       Spec    
##   0.880112  0.9780691  0.147619

The figure below plots the distribution of AUC, Sensitivity, and Specificity across the 100 folds. The tighter each distribution is to its mean, the more generalizable the model. Based on the result, our model generalizes well to ROC and sensitivity.

6.5 Generalizability under Racial Context

We further examined the generalizability of the model under the racial context. There are more dominantly minority population tracts that got gentrified between 2015 and 2020 than the tracts with dominantly white populations. Our model has higher accuracy when predicting the dominantly white census tracts, especially when identifying the tracts that are not at the risk of gentrification, while this model performs better in correctly predicting the minority tracts that are at gentrification risk. However, this model generally over-predicts for both dominantly white and minority census tracts.

7 Validation on Chicago

The validation of our predictive model using Chicago’s data, spanning the same years as the Los Angeles dataset, serves as a strategic choice to test the model’s applicability and robustness across different urban settings. Choosing Chicago for validating our gentrification model developed from Los Angeles data is predicated on several factors:

  • Similar Urban Dynamics: Both Chicago and Los Angeles are major metropolitan areas with diverse populations and significant economic disparities across different neighborhoods.

  • Prevalence of Gentrification: Like Los Angeles, Chicago has experienced noticeable gentrification, particularly in neighborhoods close to the city center and along key transit routes.

  • Comprehensive Data Availability: Chicago, similar to Los Angeles, has extensive data resources on demographics, urban infrastructure, and socio-economic indicators, which are crucial for a fair comparison and reliable validation of the predictive model.

This approach helps ascertain the model’s generalizability and accuracy in predicting gentrification beyond the initial city, ensuring that the model can potentially be adapted for various urban environments with similar underlying patterns of change. The result of this step of validation shows that the model

7.1 Loading census data

7.2 Measuring Gentrification

7.3 Feature Engineering

7.3.1 Identify Changes in Census Data

7.3.2 Calculate Crime by Tracts

7.4 Modeling building

7.4.1 Logistic regression

Estimate Std. Error z value
(Intercept) 2.0342661 1.4166172 1.4360027
changeinpoverty -0.2119973 2.1994346 -0.0963872
changeinbachelor 12.2119027 46.2968824 0.2637738
changeinwhite -4.0937686 2.3451435 -1.7456367
pctBachelors20 -6.0007187 56.6144464 -0.1059927
changein2544 -25.3416988 10.3766850 -2.4421767
houseprice20 -0.0000361 0.0000066 -5.4484393
rent 0.0029189 0.0019286 1.5134595
rent20 -0.0029450 0.0017012 -1.7311588
changeinhouseprice 0.0000467 0.0000093 5.0089634
newhousingunit -0.0018120 0.0020004 -0.9058224
incomeChange 0.0000995 0.0000261 3.8167922
crimeChangechi 0.0005312 0.0036222 0.1466597
ForMig_Change 21.1483772 11.3118569 1.8695761
Density_Change 361.2075557 207.6889392 1.7391757
raceWhite 0.0352759 0.5823814 0.0605719

## 
## Call:
## glm(formula = gentrification ~ ., family = binomial(link = "logit"), 
##     data = gentrifydatachiTrain %>% dplyr::select(gentrification, 
##         changeinpoverty, changeinbachelor, changeinwhite, pctBachelors20, 
##         changein2544, houseprice20, rent, rent20, changeinhouseprice, 
##         newhousingunit, incomeChange, crimeChangechi, ForMig_Change, 
##         Density_Change, race))
## 
## Coefficients:
##                         Estimate    Std. Error z value     Pr(>|z|)    
## (Intercept)          2.034266071   1.416617207   1.436     0.151002    
## changeinpoverty     -0.211997279   2.199434557  -0.096     0.923213    
## changeinbachelor    12.211902711  46.296882390   0.264     0.791954    
## changeinwhite       -4.093768628   2.345143519  -1.746     0.080874 .  
## pctBachelors20      -6.000718738  56.614446382  -0.106     0.915588    
## changein2544       -25.341698754  10.376685009  -2.442     0.014599 *  
## houseprice20        -0.000036144   0.000006634  -5.448 0.0000000508 ***
## rent                 0.002918906   0.001928631   1.513     0.130163    
## rent20              -0.002945022   0.001701186  -1.731     0.083423 .  
## changeinhouseprice   0.000046677   0.000009319   5.009 0.0000005472 ***
## newhousingunit      -0.001812005   0.002000398  -0.906     0.365030    
## incomeChange         0.000099507   0.000026071   3.817     0.000135 ***
## crimeChangechi       0.000531233   0.003622218   0.147     0.883401    
## ForMig_Change       21.148377205  11.311856871   1.870     0.061543 .  
## Density_Change     361.207555712 207.688939249   1.739     0.082004 .  
## raceWhite            0.035275941   0.582381440   0.061     0.951700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 262.14  on 432  degrees of freedom
## Residual deviance: 157.43  on 417  degrees of freedom
##   (126 observations deleted due to missingness)
## AIC: 189.43
## 
## Number of Fisher Scoring iterations: 8

7.4.2 Model Evaluation

## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
##  -78.7125161 -131.0682738  104.7115153    0.3994541    0.2148088    0.4729992
##    Outcome                    Probs
## 2        0                       NA
## 3        0 0.0000000000000002220446
## 5        0 0.0000008986184847365930
## 7        0 0.0000392007163883749429
## 8        0 0.0478297642928442734434
## 10       0 0.0000433827326216929180

7.4.3 Confusion matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 189   5
##          1  37  13
##                                          
##                Accuracy : 0.8279         
##                  95% CI : (0.7745, 0.873)
##     No Information Rate : 0.9262         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3072         
##                                          
##  Mcnemar's Test P-Value : 0.000001724    
##                                          
##             Sensitivity : 0.8363         
##             Specificity : 0.7222         
##          Pos Pred Value : 0.9742         
##          Neg Pred Value : 0.2600         
##              Prevalence : 0.9262         
##          Detection Rate : 0.7746         
##    Detection Prevalence : 0.7951         
##       Balanced Accuracy : 0.7793         
##                                          
##        'Positive' Class : 0              
## 

8 Conclusion

Based on the model, we have some recommendations for implementation:

1. Projecting Five-Year Scenarios The model is calibrated to predict changes based on a five-year historical data pattern. Therefore, it is best suited for forecasting similar five-year future intervals. To leverage its predictive power effectively, we recommend utilizing it to project outcomes from 2020 to 2025, and similarly for subsequent five-year periods. This approach aligns the model’s strengths with its intended application, ensuring relevance and accuracy in its predictions.

2. Regular Data Updates and community engagaement For the model to remain effective, it is crucial to incorporate the latest data regularly. As urban demographics and economic conditions evolve, updating the dataset for the next five-year span (e.g., 2025 - 2030) as soon as new data becomes available will enable the model to capture emerging trends and shifts in urban development. This continuous updating process not only enhances the model’s accuracy but also maintains its relevance in dynamic urban planning contexts.

3. Model Customization for Highly Gentrified Cities Cities that have undergone significant gentrification, such as New York, present unique challenges that may not be fully addressed by a general model. In these cases, it is advisable to develop specialized models that consider specific local factors and the saturation of gentrification effects. Such tailored models should focus on more granular aspects of change, such as shifts in micro-neighborhood demographics or the impact of policy changes, to provide useful insights for urban planners and policymakers.

4. Incorporating Localized Factors Consider enhancing the model by integrating more localized factors that influence gentrification, such as zoning laws, public transportation developments, and economic incentives. This addition can improve the model’s ability to forecast gentrification impacts more accurately within specific contexts.

Introduce the predictive model as a state-of-the-art tool designed to identify and forecast gentrification trends within urban areas. Highlight that this model leverages both spatial and non-spatial data from the American Community Survey, covering demographic shifts and socio-economic changes from 2015 to 2020.I would recommend the city to invest in the development of this model to improve the distribution of socioeconomic resources.

9 Reference

DeVylder, J., Fedina, L., & Jun, H.-J. (2019). The Neighborhood Change and Gentrification Sc ale: Factor analysis of a novel self-report measure. Social Work Research, 43(4), 279–284. https://doi.org/10.1093/swr/svz015

Hwang, J. (2016). The social construction of a gentrifying neighborhood: Reifying and redefining identity and boundaries in inequality. Urban Affairs Review, 52(1), 98-128.

Hwang, J. (2015). Gentrification in changing cities: Immigration, new diversity, and racial inequality in neighborhood renewal. The Annals of the American Academy of Political and Social Science, 660(1), 319-340.

Hwang, J., & Sampson, R. J. (2014). Divergent pathways of gentrification: Racial inequality and the social order of renewal in Chicago neighborhoods. American sociological review, 79(4), 726-751.

Richardson, J., Mitchell, B., & Edlebi, J. (2020). Gentrification and disinvestment 2020.